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Abstract 

Background: Human leukocyte antigen (HLA) genes are critical genes involved in important biomedical aspects, 
including organ transplantation, autoimmune diseases and infectious diseases. The gene family contains the most 
polymorphic genes in humans and the difference between two alleles is only a single base pair substitution in many 
cases. The next generation sequencing (NGS) technologies could be used for high throughput HLA typing but in silico 
methods are still needed to correctly assign the alleles of a sample. Computer scientists have developed such 
methods for various NGS platforms, such as lllumina, Roche 454 and Ion Torrent, based on the characteristics of the 
reads they generate. However, the method for PacBio reads was less addressed, probably owing to its high error rates. 
The PacBio system has the longest read length among available NGS platforms, and therefore is the only platform 
capable of having exon 2 and exon 3 of HLA genes on the same read to unequivocally solve the ambiguity problem 
caused by the "phasing" issue. 

Results: We proposed a new method BayesTypingl to assign HLA alleles for PacBio circular consensus sequencing 
reads using Bayes' theorem. The method was applied to simulated data of the three loci HLA-A, HLA-B and HLA-DRB1 . 
The experimental results showed its capability to tolerate the disturbance of sequencing errors and external noise 
reads. 

Conclusions: The BayesTypingl method could overcome the problems of HLA typing using PacBio reads, which 
mostly arise from sequencing errors of PacBio reads and the divergence of HLA genes, to some extent. 

Keywords: HLA typing, NGS, PacBio 



Background 

Human leukocyte antigen (HLA) system contains a set of 
genes that encode for major histocompatibility complex 
(MHC) in humans. The main function of MHC molecules 
is to mediate interactions between antigen-presenting 
cells, various lymphocytes and other body cells; therefore, 
malfunctions of HLA may associate with certain disorders 
in the immune system, for example, drug hypersensitivity 
reactions [1] and some autoimmune diseases, e.g., type 1 
diabetes and systemic lupus erythematosus [2]. HLA also 
plays an important role in transplantation of organs or 



"Correspondence: kmchao@csie.ntu.edu.tw 
+ Equal contributors 

1 Department of Computer Science and Information Engineering, National 
Taiwan University, No.1 , Sec.4, Roosevelt Road, Taipei 1 061 7, Taiwan 
^Graduate Institute of Biomedical Electronics and Bioinformatics, National 
Taiwan University, No.1 , Sec.4, Roosevelt Road, Taipei 1 061 7, Taiwan 
Full list of author information is available at the end of the article 



stem cells [3,4] and is associated with infectious diseases 
such as HIV [5]. 

There are 10,533 HLA alleles in the IMGT/HLA 
Database [6] and the number is still increasing. The HLA 
genes are the most polymorphic genes in humans and 
the difference between two alleles is often only a sin- 
gle base pair substitution. There are two main classes 
of HLA genes. The class I HLA genes (HLA-A, -B, 
and -C) each encodes a glycoprotein chain in association 
with the monomorphic molecule /32-microglobulin on the 
cell surface of most somatic cells, and the class II HLA 
genes (HLA-DP, -DQ and -DR) each encodes an a or a 
P glycoprotein chain associated as heterodimers on the 
cell surface of antigen-presenting cells [7] . The exon 2 and 
exon 3 sequence of class I HLA genes and the exon 2 
sequence of class II HLA genes form the critical peptide- 
binding groove responsible for the specificity of peptide 
recognition and binding [7]. 
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It has been shown that high-resolution HLA match- 
ing improves survival rates of marrow transplantation [8]. 
Therefore, to identify the alleles of a sample, it is bet- 
ter to use DNA-based methods instead of serological 
approaches [9]. In addition, with the advance of the 
next generation sequencing (NGS) technologies, HLA 
typing by NGS seems to be a promising approach for 
HLA sequencing and allele assignment owing to its effi- 
ciency and cost effectiveness. [10] reviewed the latest 
approaches of template preparation, sequencing plat- 
forms and data-analysis for HLA typing by NGS. It was 
showed that the four major NGS platforms, Roche GS 
454 FLX, Illumina MiSeq/HiSeq, PGM Ion Torrent and 
Pacific Biosciences SMRT (PacBio), were all capable of 
producing sequences suitable for the resolution of HLA 
genotyping. 

However, among the four platforms, HLA typing by 
PacBio was the least addressed. For example, the mod- 
ule HLA Typing in the software Omixon Target from 
Omixon only works with Illumina, Roche 454 and Ion 
Torrent. The software NGSengine from GenDx is plat- 
form independent but is optimized for Roche 454, Ion 
Torrent, MiSeq. It might be due to the high error rate of 
PacBio (about 15-20%), which makes it more difficult to 
genotype polymorphic regions such as HLA. Moreover, 
to sequence multiple samples simultaneously, sequences 
could be labelled with barcodes for identification of sam- 
ples [11]. With a higher error rate, there are more barcode- 
calling errors and reads are more likely identified as of 
wrong samples. To our knowledge, the troublesome issue 
related to wrong barcode assignment in PacBio HLA typ- 
ing has not been addressed previously. 

Despite the high raw error rate, the PacBio system actu- 
ally has two very unique advantages for HLA genotyping. 
First, the PacBio system has the longest read length among 
available NGS platforms. According to the public docu- 
ments from PacBio and the real-world data we received 
from the PacBio machine (personal communication with 
PacBio representatives), for data generated with P4-C2 
chemistry, the average read length is about 5.5 Kb and the 
number of reads is about 50 K per SMRT cell. Although 
the class I HLA genes contain seven or eight exons as 
illustrated in Figure 1, their genotypes are determined by 
numerous (mostly) single nucleotide substitutions scat- 
tered/patched in both exon 2 and exon 3 (the trimmed 
regions in Figure 1). If the genetic variants in exon 2 and 
exon 3 are derived from different reads, then the cor- 
rect phasing of the two exon 2 sequences and two exon 
3 sequences from the same individual needs to be pre- 
dicted by some computational algorithm, which inevitably 
causes "ambiguity" [7,12]. PacBio is the only platform 
capable of having exon 2 and exon 3 of HLA genes on 
the same read to unequivocally solve the ambiguity prob- 
lem. Second, although the raw error rate of PacBio is the 
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Figure 1 Target sequences preparation. We simulated target 
sequences for alleles that have only coding sequences in the 
IGMT/HLA Database. The exons of the reference genomic sequence 
were replaced by the coding sequence of each allele. 



highest among available NGS platforms, a unique advan- 
tage of PacBio is that the errors occur randomly without 
a systemic error pattern. Therefore, with sufficient cover- 
age and appropriate error-correction techniques, the final 
assemble error rate can be one of the lowest among all the 
NGS platforms [13]. 

There are two main types of PacBio reads: continuous 
long read (CLR) and circular consensus sequencing read 
(CCS). (For a short enough insert, the PacBio system is 
capable of multi-pass sequencing the raw read and gener- 
ates a consensus sequence with higher accuracy, i.e., CCS). 
Both types of reads can be used for targeted sequenc- 
ing, which is to sequence specific areas of interest within 
the genome, (e.g. regions within the HLA genes in our 
application). The characteristics of CLR and CCS reads 
in the early days for targeted sequencing were previously 
summarized in [14] and we only excerpt a part in Table 1. 

To solve the challenging "ambiguity" related to geno- 
typing of class I HLA genes, the targeted sequences have 
to cover the whole region of exon 2 and exon 3 (and 
also intron 2 in between the two exons), which is about 
1 Kb, and the read lengths of both types fit the require- 
ment. Both types of reads are also proved to be effective in 
detecting genomic variants [14]. However, in order to save 
cost for diagnosis applications of HLA typing, it is better 
to apply the barcode multiplexing technology [11]. With a 
high error rate, the barcodes of CLR reads are much more 



Table 1 Characteristics of CLR and CCS reads for targeted 
sequencing applications excerpted from [14] 





CLR 


CCS 


Read Accuracy 


85-90% 


> 98% (3 pass) 


Maximum Mean Readlength 


2Kb 


1 Kb 


#Reads 


100 K 


15 K 



#Reads here stands for the average number of usable reads per SMRT cell for 1 
Kb insert. 
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likely to be mis-mapped and the reads are more likely 
identified as of wrong samples. Therefore, we choose CCS 
reads as the target of our method. 

One possibility to genotype HLA is to use assembly 
methods such as CAP3 and PCAP [15,16] to recover 
the two genomic sequences of a sample and compare 
them with the alleles. However, there are still problems 
using CCS reads. First, the number of reads for each 
amplicon of each sample is not high (depending on the 
number of barcodes used). Take the experiment in [11] 
for example. There were 2,352 distinct sequence prod- 
ucts (49 amplicons x 48 barcode pairs). When the insert 
size is 1 Kb, the average number of reads for each distinct 
product was only 6.38. Second, the error rate of CCS is 
still too high to distinguish two HLA alleles having only 
little base-pair difference. Third, there are still barcode- 
calling errors for CCS reads (data not shown), which could 
induce noise reads and increase the obstacles of HLA 
typing. 

To address these problems, here we propose a method 
using Bayes' theorem. Given a few CCS reads generated 
from the target sequence (of the regions containing exon 
2 and exon 3 for class I HLA genes or of the regions con- 
taining exon 2 for class II genes) of a sample, our method 
is able to correctly assign the pair of alleles of the sam- 
ple. We simulated the alleles for each sample and the 
CCS reads generated according to the alleles. Different 
levels of reads from wrong samples were added to dis- 
turb the experiments. The experimental results showed 
that our method can stand for a high percentage of noise 
reads. 

Methods 

Simulation 

The simulation follows the setting of the multiplexing 
targeted sequencing technology [11]. In each run, there 
are multiple samples with multiple amplicons sequenced 
simultaneously. We assume that the reads have been 
grouped by their samples and amplicons (loci of HLA) and 
the reads might be identified as of wrong samples due to 
barcode-calling errors. 

Alleles of the samples 

The alleles of the samples were assigned following the 
distribution of the Taiwan Minnan population [17]. We 
obtained the frequencies of alleles from the allele fre- 
quency net database [18,19]. For the alleles with zero 
frequency, we gave them 0.1% frequencies of appearance. 
In their study, only HLA- A, B and DRB1 are involved, so 
we only simulated HLA on these three loci. The table of 
frequencies for the three loci can be found in Additional 
file 1: Table SI. When implementing our methods, the fre- 
quencies were normalized to make the summation of each 
loci equal 1. 



Linkage disequilibrium was not concerned, which 
means alleles on different loci are assigned independently. 
In reality, given the probability density function of the 
pairs of alleles in a population, we can adjust our meth- 
ods by setting p{auaj) in Equation (1) accordingly. In 
addition, since the frequencies were censused for alle- 
les with 2-digit resolution, the alleles of higher resolution 
were selected with uniform distribution. To observe the 
impacts of homozygous samples on genotyping, 30% of 
the samples have two identical alleles. 

Target sequences of the alleles 

The HLA sequences were downloaded from the 
IGMT/HLA Database Release 3.15.0 [20]. Since there are 
only CDSs instead of genomic sequences for most alleles 
in the database, we created the genomic sequences and 
corresponding target sequences of our own. 

We illustrated the creation of the target sequences 
in Figure 1. First, for each locus, a reference genomic 
sequence was selected and the positions of its exons (the 
light grey blocks) were detected by aligning its CDS to 
its reference genomic sequence and we have the long 
rectangle intercepted with only the light grey block in 
Figure 1. Then, the genomic sequence of each allele was 
created by replacing the exons of the reference genomic 
sequence with the exon sequences of the allele (the con- 
tinuous dark grep blocks), which were obtained from the 
CDS alignment file locus_mic.txt downloaded from the 
HLA database. Now we have the long rectangle inter- 
cepted by some dark grey blocks (the exons from the 
selected allele) and other light grey blocks (the exons from 
the reference allele that the selected allele misses). The 
missing nucleotides (represented as * in locus_nuc.txt) are 
replaced with nucleotides of another sequenced allele in 
the same positions. Most sequence-based typing methods 
focus on exon 2 and exon 3 for HLA class I loci and exon 
2 alone for HLA class II loci because the regions are most 
polymorphic and encode the peptide-binding groove that 
binds to HLA antigens. Therefore, we further trimmed a 
range of the genomic sequences that contain correspond- 
ing exon(s) as the target sequences (the short rectangle in 
the bottom). There are alleles that are identical over exons 
2 + 3 for HLA class I and exon 2 for HLA class II. To 
avoid ambiguity, we selected one allele from each group 
of alleles with identical target sequences. Table 2 lists the 
reference alleles, the starting positions and lengths of the 
trimmed range on the reference genomic sequences, and 
the numbers of alleles with unique target sequences for 
loci HLA-A, B and DRB1. 

Reads generated from the target sequences 

The CCS reads for the target sequence of an allele were 
produced with PBSIM [21], which is the only simula- 
tor that generates PacBio libraries as far as we know. 
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Table 2 Three HLA loci and their corresponding reference 
alleles 





A 


B 


DRB1 


Reference 


A*01:01:01:01 


B*07:02:01 


DRB1*01:01:01 


Start 


380 


400 


5400 


Length 


1100 


950 


600 


#unique alleles 


2335 


3075 


1388 



We adopted its model-based method and the default set- 
tings for CCS reads (length-mean=450, length-sd=170, 
accuracy-mean=0.98, accuracy-sd=0.02). 

Types of runs 

To estimate the number of reads required to genotype, 
we designed three types of runs, all of which contain the 
same number of reads and different numbers of samples 
(see Table 3). Since the number of barcode pairs for mul- 
tiplex sequencing is 48 [11], the number of samples in a 
run is a factor of 48. In each run, we set the total number 
of reads for a locus (amplicon) as 960 because the num- 
ber of usable CCS reads is about 15 K for 1 Kb insert and 
multiple loci of HLA are sequenced simultaneously. We 
simulated ten groups of samples for each type of runs. 
For each group of samples, we re-generated the reads and 
applied our method ten times. 

Noise reads 

To understand the impacts of barcode-calling errors, in 
each run and for each locus, we created a pool of reads 
that contained the reads of the locus from all the sam- 
ples in the run. Before genotyping a sample, a few number 
of reads were randomly selected from the pool to disturb 
correct reads (i.e. the reads generated from the sample). 
We call such reads as noise reads. Five different levels of 
noise reads (0, 10, 20, 30, 40 noise reads) were added. 
Figure 2 gives an illustration of the simulation process. 

Bayes' theorem for HLA typing 

Given the set of reads assigned to a sample (see the smaller 
rectangles in the rightmost block of Figure 2), we used 
Bayes' theorem to infer the pair of alleles (a p , a q ) of the 
sample. 



Table 3 Three types of runs with the same total number of 
reads: 960 





Type 1 


Type 2 


Type 3 


#correct reads/allele 


40 


20 


10 


#sam pies/group 


12 


24 


48 


#groups 


10 


10 


10 



Denote the reads as r\, r2, . . . , r n and a pair of alleles as 

& h a j' 

p(a h aj)p(r 1 ,...,r n \a h a J ) 
p(ai,aj\ri,...,r„) = — -. (1) 

The probability p (a^aj), which is the probability of a 
random sample having the allele pair (a^aj), depends 
on the population of the sample and we assumed all 
p (au aj) f s are the same when the population is unknown. 
To find the pair of alleles (a p , a q ) that maximize formula 
(1) when the r\, . . . , r n are fixed, it is sufficient to compare 
p{ri,...,r n \auaj). 

Given the alleles that produce the reads, the reads are 
independent of each other. Therefore, 

p(r 1 ,...,r n \a i ,a J ) = \\ p {r k \a it aj) . (2) 

\<k<n 

With the alleles a{ and aj, a read can be produced from 
only one of them. Therefore, we set the probability of a 
read given the pair of alleles to be the the higher prob- 
ability of the read given only one of the pair of alleles. 

p (r k \a u aj) = max {p (r k \ad >P {r k \aj)\ ♦ (3) 
Equations 2 and 3 lead to 

p(n,...,r n \ai,aj) = ]~[ max [p (r k \ad >P {r k \aj)\ • 

\<k<n 

(4) 

Denote the error rate of sequencing as <5, the number of 
matches of the alignment between r k and at as \r k = cii\, 
the number of mismatches as \r k ^ d{\ and the length of 

r /c as \r k \. 

p{r k \ai)= n 5 n ^-^ 

I rk M'l \n=ai\ 



\n<\ \n=ai\ 

Under this definition, p (r k \ai) stands for the probabil- 
ity when a sequence of length \r k \ generated by at equals 
the sequence of r k . The summation of this function for all 
possible sequences of length \r k \ would be 1 and therefore, 
it is a legal probability function. 

Since it remains the same to compare log 
p (ri, . . . ,r n \au aj) instead of p (ri, . . . ,r n \au we 
have 

(a pt a q ) = arg max/? (n, . . . , r n \a if aj) 

= arg max logp (ri, . . . , r n \au aj) 

= arg max ^ max{|r^ = a t \, \r k = aj\}. 

\<k<n 

(6) 
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Figure 2 Simulating pools of reads for a group of samples. Left: simulating the alleles of the samples. Middle: 1 0 iterations of generating the 
reads. Right: mixing reads assigned to correct samples with noise reads. 



Using this method, the number of reads produced by a p 
(or a q ) can be estimated as the number of reads whose 
\r k = a p \ is greater than \r k = a q \. When the num- 
ber of reads of one allele is far less than that of the other 
(e.g. 50%), the sample is regarded as having two identical 
alleles. 

This method works well when the given set of reads are 
all from the alleles of the sample (i.e. the correct reads). 
However, the barcode-calling errors might result in mix- 
ing reads from different samples (i.e., the noise reads). 
Alleles that are close to both the correct reads and the 
noise reads are more likely to be predicted as the answers. 
To deal with the problem, we assumed there are a few 
number of noise reads before selecting the pair (a pt a q ). 

Denote m as the ratio of noise reads assumed. We select 
the pair 

(a p ,a q ) = arg max ^ I (1 - p k ) mzx{\r k = m\, \r k = aj\] 

EPk<nm i< k < n 

+ p k .mxx{\r k = a x \}\ . 

X J 

(7) 

In the equation, p k = 0 means the read r k is a cor- 
rect read and p k = 1 means r k is a noise read. Note that 
equation 7 is the same as equation 6 when m = 0. We 
name the methods based on equation 6 and equation 7 as 
BayesTypingl and BayesTypingl, respectively. 

The value \r k = a\\ can be calculated by aligning the 
read r k and the allele We use the score of the alignment 
instead of the number of matches in our program because 
the score catches more information (e.g. indels). 

Implementation 

Given the set of reads assigned to a sample, we used 
LASTZ [22] to map the reads to the genomic sequence of 
the reference allele (identity =90, coverage =70). For each 
read, we trimmed the regions aligned to introns and 
only reserved those regions aligned to the exons of the 
genomic sequence (see Figure 3-2). The reads that had 



short sequences left (less than 50 bp) were eliminated. 
We then used LASTZ to map the remained reads to 
the coding sequences of all the unique alleles (explained 
in Section 'Methods') separately. We chose a lower gap 
penalty and a low gap extension penalty (100, 20) since 
PacBio reads tend to have more indels. The alignment 
results of all pairs of the remained reads and the unique 
alleles were saved and only the best score for each (read, 
allele) pair was maintained. 

To accelerate, we excluded those unlikely alleles, which 
had less than ten percent of reads satisfying \r k = ai\ = 
max a ii ^ \r k = a x \ because we assume the correct alle- 
les tend to have more number of best matching reads. 
To reduce those impacts of noise reads, we excluded 
the unlikely reads, which had less than 96% of identities 
to all the remained alleles. The pre-processing steps are 
illustrated in Figure 3. 

At last, we used the methods BayesTypingl and 
BayesTypingl to infer the pair of alleles and their cor- 
responding reads. The original sequences of the cor- 
responding reads can further be used to assemble the 
genomic sequences of the alleles. 



1 . Map to the ref allele 3. Pairwise alignments 

j r CDS of reads CDS of alleles 

2. Reserve coding regions 4. Filter reads and alleles 




Figure 3 Pre-processing steps of our methods. 1). Map the reads 
to the reference genomic sequence. 2). For each read, trim its introns 
by reserving the sequences mapping to the coding regions of the 
reference. 3). Align each reserved sequence to the CDS of each allele 
respectively. 4). Filter unlikely reads and alleles. 
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Table 4 Accuracies of BayesTypingO for experiments 
without noise reads 





HLA-A 


HLA-B 


HLA-DRB1 


Type 1 


99.92% 


99.92% 


1 00.00% 


Type 2 


99.50% 


99.21% 


1 00.00% 


Type 3 


97.62% 


96.87% 


99.98% 



The three types of runs are as defined in Table 3. 



Results and discussion 

We compared the first method BayesTypingO with 
NGSengine, which is a platform-independent software for 
NGS data analysis of HLA genes. The number of reads 
for a sample in Type 2 and Type 3 experiments seems too 
few for NGSengine and it could not predict any alleles. 
The Type 1 experiments contain 1,200 sets of reads (12 
samples/group x 10 groups x 10 iterations) and each set 
contains 80 reads (40 correct reads/allele x 2 alleles). We 
regarded a successful prediction when the two predicted 
alleles are both correct. Without inducing noise reads, 
when typing HLA-A, NGSengine could only successfully 
predicted 274 pairs of alleles (22.83%). On the other hand, 
BayesTypingO successfully predicted 1199 pairs of alleles 
(99.92%). NGSengine requires more reads to achieve the 
same accuracy (data not shown). We listed the accuracies 
of BayesTypingO for the three HLA loci and the three types 
of experiments without noise reads in Table 4. 



For experiments with noise reads, we compared our 
methods with a method MaxTwo, which gives the first 
two alleles by comparing the number of reads having the 
maximum alignment scores with them. For all the three 
methods, when the number of reads of one allele is less 
than 50% of the number of the other allele, the sample is 
regarded as having two identical alleles. 

We repeated the experiments by adding different num- 
bers of noise reads. We set the assumed ratio of noise 
reads m for BayesTypingl as 20% of the number of input 
reads (correct reads + noise reads). Figure 4 shows the 
error rates for HLA-A and HLA-B, respectively. For HLA- 
DRB1, all the three methods could identify the correct 
alleles quite well. The accuracy is about 99% even for the 
type 3 experiments with the most number of noise reads 
(20 correct reads and 40 noises reads). 

It could be presumed that the error rates would increase 
as the number of noise reads increases, e.g., when there 
are more barcode-calling errors, or the number of correct 
reads decrease, e.g., when multiplexing more samples. For 
example, for HLA-A, the error rate is more than 10% for 
type 3 experiments when only 10 noise reads are induced. 
BayesTypingl showed the best capability to tolerate the 
disturbance of the noise reads. Even when there were 
no noise reads, which conflicted with the assumption of 
BayesTypingl, BayesTypingl also performed well. One the 
other hand, BayesTypingO usually performed best when 
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Figure 4 Experiments results of HLA-A and HLA-B genotyping. We plotted the results of the three types of runs for HLA-A and HLA-B 
genotyping in (a)-(c) and (d)-(f) respectively, using three different methods. The x-axis stands for the numbers of added noise reads and y-axis 
stands for the error rates. 
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0(0%) 10(20%) 20(33%) 30(43%) 40(50%) 50(56%) 

The number of induced noise reads (ratio%) 

Figure 5 The change of error rates with different levels of noise reads. Experiments of Type 2 HLA-A using BayesTypingl with different levels 
of m, which is the ratio of noise reads assumed. 



there are no noise reads, but it suffered as the number 
of noise reads increased. It even performed worse than 
MaxTwo when the noise reads outnumbered the correct 
reads. 

The difference error rates between the three loci might 
reflect the characteristics of the sequences of alleles cur- 
rently gathered. Although the number of HLA-B alleles 
is more than that of HLA-A, the HLA-B sequences seem 
more distinguishable because the error rates of typing 
were lower. It seems that HLA-DRB1 has the best distin- 
guishable alleles. 

To study the effect of the parameter m for BayesTyp- 
ingl, we set m as different percentages of the number of 
the input reads and ran BayesTypingl repeatedly. The data 
we used are Type 2 HLA-A experiments and more noise 
reads were added. We plotted the results in Figure 5. 

It showed that a higher level of m worked better when 
noise reads were added, i.e., given any vertical line, a 
higher level of m has a fewer error rate. The differ- 
ence of error rates also becomes larger as the number 
of noise reads increases. However, the error rates would 
converge at some point and a larger m would make lit- 
tle effect, i.e., given any vertical line, the difference of 
error rates shrinks as the level of m increases. When 
m is too large and there are only few numbers of noise 
reads, BayesTypingl will perform worse than BayesTyp- 
ingO (data not shown). Theoretically, all pairs of alleles 
have the same probability when m is 100% of the input 
reads. 

Ambiguous allele combinations 

For data with read length not long enough (such as 450 
bp in the simulation), there are still ambiguous allele 



combination problems to type class I HLA. Assuming 
there are four alleles with the following patterns of exons, 

allele a : exon 2a + exon 3a 

allele b : exon 2b + exon 3b 

allele c : exon 2a + exon 3b 

allele d : exon 2b + exon 3a. 

When the read length is not long enough to cover the 
region of exon 2 + intron + exon 3, for the samples with 
allele a and allele b (or allele c and allele d), there is no way 
to distinguish which combination of alleles is correct. 

To address this problem, we enumerated such ambigu- 
ous pairs of alleles for HLA-A and randomly selected 24 
pairs for the samples in a Type 2 run. The 24 ambigu- 
ous pairs of alleles were listed in Additional file 1: Table 
S2. Except that the average read length was 1 Kb, other 
steps and parameters to generate the reads were exactly 
the same as described in Simulation. To make a contract, 
we also generated reads with average length 450 bp and 
doubled the number of reads for each allele (from 20 to 
40) to reach similar depth of coverage. As in Experiments, 

Table 5 The results of BayesTypingl on 1 Kb and 450 bp 
reads for 24 samples with ambiguous pairs of alleles 

40 

240 
1 

212 
5 



#Noise reads 


0 


10 


20 


30 


1 Kb (correct) 


240 


240 


240 


240 


1 Kb (ambiguous) 


0 


0 


0 


1 


450 bp (correct) 


234 


231 


235 


229 


450 bp (ambiguous) 


132 


69 


34 


7 



Chang etal. BMC Bioinformatics 2014, 15:296 
http://www.biomedcentral.eom/1 471 -21 05/1 5/296 



Page 8 of 10 



15 



o 

LLJ 10 



12 samples 
24 samples 
48 samples 




0 10 20 30 40 

The number of induced noise reads 

Figure 6 The influence of the diversity of the read pools. Experiments of HLA-A with noises from different numbers of samples using 
BayesTypingl. 



we also re-generated both types of reads ten times for 
this group of samples. Different levels of noise reads were 
induced and we applied BayesTypingl to genotype. The 
results are summarized in Table 5. In addition to the 
number of times that BayesTypingl correctly assigned 
the pairs of alleles, we also listed the number of times 
when BayesTypingl had more than one answers (the cor- 
rect answer was included). 

It showed that using 1 Kb reads, BayesTypingl could 
correctly assign the pairs of alleles without ambiguity in 
most cases, even when the number of the noise reads 
equalled the number of correct reads. On the other hand, 
using 450 bp reads, BayesTypingl could also achieve good 
accuracies. This might be due to the variation of PacBio 
read length. With higher depth of coverage, there are still 
a few reads that are long enough to cover exon 2 and exon 
3. It is surprising that BayesTypingl caused much more 
ambiguities when the number of noise reads was fewer. It 
might be because BayesTypingl tends to treat longer reads 
as noise reads when there are no noise reads in fact. 

Diversity of noise reads 

The source of noise reads might also affect the error rates 
of typing. In the worst case, when the noise reads are all 
from an allele of another sample, it is more likely to iden- 
tify the wrong allele. When the noise reads are diverse, the 
corrected alleles might be easier to stand out. 

To compare the impact of the diversity of the noise 
reads, we mixed noise reads from read pools that con- 
tained different number of samples, i.e. 12, 24 and 48, 
respectively, with correct reads. Other parameters are the 
same as Type 2 experiments. The experimental results 
using HLA-A were shown in Figure 6. 



The error rates of these three experiments showed not 
much difference. The number of correct reads, the num- 
ber of noise reads and the difference between the two 
numbers play a much more important role. 

Homozygous and heterozygous samples 

As mentioned in Section 'Methods', for each loci, we sim- 
ulated 30% of homozygous samples. To test whether the 
accuracies of homozygous samples and those of heterozy- 
gous samples are significantly different or not, we utilized 
Fishers exact test [23]. We summed the numbers of cor- 
rect and wrong predictions for homozygous and heterozy- 
gous samples of the HLA-A and HLA-B experiments. 
(HLA-DRB1 experiments were excluded since most of the 
predictions are correct.) The contingency table of the four 
number were expressed in Table 6. The tables of the three 
types of runs can be found in Additional file 1: Table S3- 
S5. For each type of experiments, the total sum should be 
{(#loci) x (#samples) x (#groups of samples) x (itera- 
tions) x (#levels of noise reads)} (2 x 24 x 10 x 10 x 5 for 
the type 2 experiments). 

We applied Fishers exact test for the contingency 
tables of the three types of experiments and listed the 
p-values and odd ratios in Table 7. The odds ratio 
is calculated by {(#correct for homozygous)/(#error for 



Table 6 Four numbers for Fisher's exact test of type 2 
experiments 





Homozygous 


Heterozygous 


Correct 


6720 


15814 


Error 


330 


1136 
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Table 7 Results of Fisher's exact test 





Type 1 


Type 2 


Type 3 


#reads/allele 


40 


20 


10 


Odds ratio 


0.66 


1.46 


1.47 


p-value 


0.13 


1.2 x 10~ 9 


2.2 x 10~ 16 



homozygous)}/{(#correct for heterozygous)/(#error for 
heterozygous)}. It showed homozygous samples had more 
advantages over heterozygous samples when the num- 
ber of correct reads were fewer. It might be because the 
number of correct reads for the same allele doubled for 
homozygous samples, which made the correct allele stand 
out. 

Conclusions 

The experimental results showed that BayesTypingl can 
identify HLA alleles accurately using reasonably low num- 
ber of PacBio CCS reads. BayesTypingl can tolerate 
sequencing errors, which are introduced by the PacBio 
sequencing technology, and noise reads, which are intro- 
duced by barcode-calling errors, to some degree. The 
three types of experiments suggest it is better to multi- 
plex 12 or 24 samples instead of 48 samples to maintain 
a high accuracy, since the number of reads for each sam- 
ple in a 48-sample example might be too few for HLA 
typing. 

Additional file 



Additional file 1 : It contains the table listing the frequencies of the 
2-digits alleles of the three loci: HLA- A, HLA-B and HLA-DRB1 . It also 
contains the twenty-four pairs of ambiguous alleles we used in our 
experiment and the contingency tables for Fish's exact test. 
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